This assignment is due by 11:59pm Pacific Time on Friday, September 26th, 2025.
Learning Goals
Download, read, and get familiar with an external dataset.
Step through the EDA “checklist” presented in class
Practice making exploratory plots
Assignment Description
We will work with air pollution data from the U.S. Environmental Protection Agency (EPA). The EPA has a national monitoring network of air pollution sites that measure particulate matter (PM) concentrations. A primer on particulate matter air pollution can be found here.
The primary question you will answer is whether daily concentrations of PM\(_{2.5}\) (particulate matter air pollution with aerodynamic diameter less than 2.5 \(\mu\)m) decreased in California over the 20 years spanning from 2002 to 2022.
Your assignment should be completed in Quarto and all code should be included.
Steps
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
(30 points) Given the formulated question from the assignment description, you will now conduct EDA Checklist items 1-5. First, download 2002 and 2022 data for all sites in California from the EPA Air Quality Data website, then read the data into R. For each of the two datasets, check the dimensions, headers, footers, variable names and variable types. Check the distribution of the key variable we are analyzing (PM\(_{2.5}\)). Write up a summary of all of your findings.
Rows: 15976 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (12): Date, Source, Site ID, Units, Local Site Name, AQS Parameter Descr...
dbl (10): POC, Daily Mean PM2.5 Concentration, Daily AQI Value, Daily Obs Co...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 59918 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (12): Date, Source, Site ID, Units, Local Site Name, AQS Parameter Descr...
dbl (10): POC, Daily Mean PM2.5 Concentration, Daily AQI Value, Daily Obs Co...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
spc_tbl_ [59,918 × 22] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Date : chr [1:59918] "01/01/2022" "01/02/2022" "01/03/2022" "01/04/2022" ...
$ Source : chr [1:59918] "AQS" "AQS" "AQS" "AQS" ...
$ Site ID : chr [1:59918] "060010007" "060010007" "060010007" "060010007" ...
$ POC : num [1:59918] 3 3 3 3 3 3 3 3 3 3 ...
$ Daily Mean PM2.5 Concentration: num [1:59918] 12.7 13.9 7.1 3.7 4.2 3.8 2.3 6.9 13.6 11.2 ...
$ Units : chr [1:59918] "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" ...
$ Daily AQI Value : num [1:59918] 58 60 39 21 23 21 13 38 59 55 ...
$ Local Site Name : chr [1:59918] "Livermore" "Livermore" "Livermore" "Livermore" ...
$ Daily Obs Count : num [1:59918] 1 1 1 1 1 1 1 1 1 1 ...
$ Percent Complete : num [1:59918] 100 100 100 100 100 100 100 100 100 100 ...
$ AQS Parameter Code : num [1:59918] 88101 88101 88101 88101 88101 ...
$ AQS Parameter Description : chr [1:59918] "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" ...
$ Method Code : num [1:59918] 170 170 170 170 170 170 170 170 170 170 ...
$ Method Description : chr [1:59918] "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" ...
$ CBSA Code : num [1:59918] 41860 41860 41860 41860 41860 ...
$ CBSA Name : chr [1:59918] "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" ...
$ State FIPS Code : chr [1:59918] "06" "06" "06" "06" ...
$ State : chr [1:59918] "California" "California" "California" "California" ...
$ County FIPS Code : chr [1:59918] "001" "001" "001" "001" ...
$ County : chr [1:59918] "Alameda" "Alameda" "Alameda" "Alameda" ...
$ Site Latitude : num [1:59918] 37.7 37.7 37.7 37.7 37.7 ...
$ Site Longitude : num [1:59918] -122 -122 -122 -122 -122 ...
- attr(*, "spec")=
.. cols(
.. Date = col_character(),
.. Source = col_character(),
.. `Site ID` = col_character(),
.. POC = col_double(),
.. `Daily Mean PM2.5 Concentration` = col_double(),
.. Units = col_character(),
.. `Daily AQI Value` = col_double(),
.. `Local Site Name` = col_character(),
.. `Daily Obs Count` = col_double(),
.. `Percent Complete` = col_double(),
.. `AQS Parameter Code` = col_double(),
.. `AQS Parameter Description` = col_character(),
.. `Method Code` = col_double(),
.. `Method Description` = col_character(),
.. `CBSA Code` = col_double(),
.. `CBSA Name` = col_character(),
.. `State FIPS Code` = col_character(),
.. State = col_character(),
.. `County FIPS Code` = col_character(),
.. County = col_character(),
.. `Site Latitude` = col_double(),
.. `Site Longitude` = col_double()
.. )
- attr(*, "problems")=<externalptr>
First, the monitoring infrastructure saw substantial expansion and technological advancement. The dataset from 2022 is over three times larger than the one from 2002, containing 59,918 daily observations compared to 15,976. This indicates a major increase in the number of monitoring sites or the frequency of reporting, providing much more comprehensive spatial and temporal coverage across the state. Furthermore, the technology for measuring PM2.5 advanced from the older “Andersen RAAS2.5-300” filter-based method used in 2002 to the modern, continuous “Met One BAM-1020” monitor in 2022, allowing for real-time data collection.
Most significantly, the data provides clear preliminary evidence of improved air quality. Initial observations show that daily PM2.5 concentrations were consistently and markedly higher in 2002 (e.g., readings in the 20-40 µg/m³ range) compared to those in 2022 (e.g., readings in the 2-14 µg/m³ range). This suggests a substantial decrease in the levels of fine particulate matter pollution over the two decades, likely resulting from effective air quality regulations and policies.
In conclusion, the comparison between 2002 and 2022 reveals a positive story: California’s capacity to monitor air pollution grew immensely through a larger network and better technology, while the data itself indicates a successful reduction in harmful PM2.5 pollution across the state.
(10 points) Combine the two years of data into one data frame. Use the Date variable to create a new column for year, which will serve as an identifier. Change the names of the key variables so that they are easier to refer to in your code.
combined_CAdata <- combined_CAdata |>rename(PM25 =`Daily Mean PM2.5 Concentration`,AQI =`Daily AQI Value`,Site_Name =`Local Site Name`,County_FIPS =`County FIPS Code`,State_FIPS =`State FIPS Code` )
(20 points) Create a basic map (or maps) in leaflet showing the locations of the monitoring sites, using different colors for each year. Summarize the spatial distribution of the sites. Does this distribution change from 2002 to 2022?
sites_2002 <- ca_2002 |>distinct(`Site ID`, .keep_all =TRUE) |>mutate(Year ="2002")sites_2022 <- ca_2022 %>%distinct(`Site ID`, .keep_all =TRUE) |>mutate(Year ="2022")site_pal <-colorFactor(palette =c("red", "blue"), domain =c("2002", "2022"))# Map for 2002map_2002 <-leaflet(sites_2002) |>addProviderTiles('CartoDB.Positron') |>addCircles(lat =~`Site Latitude`, lng =~`Site Longitude`,label =~paste0(`Local Site Name`),color =~site_pal("2002"), # Force color to red for 2002opacity =1, fillOpacity =1, radius =500 ) |>addLegend('bottomleft', colors ="red", labels ="2002 Sites",title ='Monitoring Year 2002', opacity =1) |>addControl("2002 Monitoring Sites", position ="topright")# Map for 2022map_2022 <-leaflet(sites_2022) |>addProviderTiles('CartoDB.Positron') |>addCircles(lat =~`Site Latitude`, lng =~`Site Longitude`,label =~paste0(`Local Site Name`),color =~site_pal("2022"), # Force color to blue for 2022opacity =1, fillOpacity =1, radius =500 ) |>addLegend('bottomleft', colors ="blue", labels ="2022 Sites",title ='Monitoring Year 2022', opacity =1) |>addControl("2022 Monitoring Sites", position ="topright")map_2002
map_2022
Based on the maps, the spatial distribution of PM2.5 monitoring sites in California changed dramatically from 2002 to 2022, evolving from a sparse network concentrated primarily in the major coastal metropolitan areas of Los Angeles and the San Francisco Bay Area to a dense, comprehensive statewide system that extensively covers the interior Central Valley, agricultural regions, and smaller population centers, reflecting a significant expansion of air quality monitoring infrastructure to better address environmental justice concerns and capture pollution hotspots across the entire state.
(10 points) Check for any data issues such as missing or implausible values of PM\(_{2.5}\) in the combined dataset. Calculate the proportion of missing/implausible values for each year and report any temporal patterns you see in these observations.
The analysis reveals excellent data quality for both years, with very low rates of missing or implausible values. In 2002, only 0.019% of observations (3 out of 15,976) were implausible (≤ 0 μg/m³), while in 2022, 0.57% of observations (343 out of 59,918) were implausible. No missing values or values exceeding 500 μg/m³ were detected in either year. The slight increase in low-level implausible values in 2022 may reflect more sensitive detection of very low concentrations by modern instrumentation. Overall, both datasets show exceptionally high data quality with >99.4% valid observations.
(30 points) Explore the main question of interest at three different levels of spatial resolution. Create data visualizations (e.g. boxplots, histograms, line plots, violin plots) and summary statistics that best suit each level of the analysis. Be sure to write up explanations of what you observe at each level.
Level 1: State. Examine the primary question for the entire state.
ggplot(combined_CAdata, aes(x =factor(Year), y = PM25)) +geom_boxplot(fill ="lightblue") +labs(title ="Statewide PM2.5 Distribution",x ="Year", y ="PM2.5 (μg/m³)") +theme_minimal()
At the statewide level, there is a dramatic and substantial improvement in air quality from 2002 to 2022. The average (mean) PM₂.₅ concentration decreased by 52%, falling from 16.12 μg/m³ to 8.41 μg/m³. The median concentration shows a similar pattern of improvement, decreasing from 12.0 μg/m³ to 6.8 μg/m³ (a 43% reduction).
The standard deviation also decreased significantly from 13.87 to 7.64, indicating that not only did the average pollution level drop, but the day-to-day variability in air quality also became much more consistent. The combination of these findings - lower central tendency (mean/median) and reduced variability - suggests that extreme pollution events became much less frequent over this twenty-year period.
This dramatic improvement likely reflects the cumulative impact of statewide air quality regulations, technological advancements in emissions control, and changes in industrial and transportation practices across California.
Level 2: County. Examine the primary question for every county in California.
ggplot(county_change, aes(x = Change)) +geom_density(fill ="steelblue", alpha =0.7) +geom_vline(xintercept =0, linetype ="dashed", color ="red") +labs(title ="County-Level PM2.5 Improvements",subtitle ="51 California counties, 2002-2022",x ="Change in Mean PM2.5 (μg/m³)", y ="Density") +theme_minimal()
Warning: Removed 4 rows containing non-finite outside the scale range
(`stat_density()`).
The analysis reveals a more nuanced pattern than initially expected. While the majority of counties (41 out of 47) showed improved air quality from 2002 to 2022, 6 counties actually showed increased PM₂.₅ levels. The average improvement across all counties was -4.55 μg/m³, but this masks significant variation. The distribution ranges from a substantial improvement of -13.24 μg/m³ to a concerning increase of +7.94 μg/m³ in some counties.
The density plot shows a bimodal distribution with two peaks: one around -7 μg/m³ (representing counties with strong improvements) and another closer to 0 μg/m³. This suggests that air quality improvements were not uniform across California. While most counties benefited from statewide regulations and technological advances, certain counties experienced worsening air quality, possibly due to local factors such as increased industrial activity, population growth, wildfires, or changes in transportation patterns that offset broader improvements.
Level 3: City. Restrict the data to sites in Los Angeles county and examine the primary question for every site.
# A tibble: 18 × 5
`Site ID` Site_Name `2002` `2022` Change
<chr> <chr> <dbl> <dbl> <dbl>
1 060370002 Azusa 20.8 9.72 -11.0
2 060370016 Glendora NA 8.42 NA
3 060371002 Burbank 24.0 NA NA
4 060371103 Los Angeles-North Main Street 22.0 11.6 -10.4
5 060371201 Reseda 18.9 10.7 -8.14
6 060371301 Lynwood 23.3 NA NA
7 060371302 Compton NA 13.0 NA
8 060371601 <NA> 23.9 NA NA
9 060371602 Pico Rivera #2 NA 11.4 NA
10 060372005 Pasadena 20.3 9.09 -11.2
11 060374002 Long Beach (North) 19.5 9.92 -9.55
12 060374004 Long Beach (South) NA 12.0 NA
13 060374008 Long Beach-Route 710 Near Road NA 13.0 NA
14 060374009 Signal Hill (LBSH) NA 8.85 NA
15 060374010 North Hollywood (NOHO) NA 13.0 NA
16 060376012 Santa Clarita NA 9.14 NA
17 060379033 Lancaster-Division Street 10.4 7.52 -2.86
18 060379034 Lebec 4.82 3.50 -1.32
ggplot(la_site_summary, aes(x =reorder(Site_Name, Change), y = Change)) +geom_col(fill =ifelse(la_site_summary$Change <0, "steelblue", "salmon")) +coord_flip() +geom_hline(yintercept =0, linetype ="dashed", color ="black") +labs(title ="LA County Site PM2.5 Changes",subtitle ="2002 to 2022",x ="Monitoring Site",y ="Change in PM2.5 (μg/m³)") +theme_minimal() +theme(axis.text.y =element_text(size =8))
Warning: Removed 11 rows containing missing values or values outside the scale range
(`geom_col()`).
The analysis of Los Angeles County monitoring sites reveals consistent air quality improvement across the region, with all 7 sites that have complete data for both years showing reduced PM₂.₅ levels from 2002 to 2022. The improvements were substantial, averaging a -7.79 μg/m³ reduction across sites. However, the data completeness varies significantly - only 7 of 18 sites have measurements for both time periods, while 11 sites have missing data for either 2002 or 2022. Among sites with complete data, reductions ranged from -1.32 μg/m³ in Lebec to -11.20 μg/m³ in Pasadena, indicating that while all areas improved, the magnitude of improvement varied considerably across different neighborhoods. This pattern suggests that regional air quality policies were broadly effective throughout Los Angeles County, though local factors influenced the degree of improvement achieved at each location.
Reminder: after you upload your final rendered document to GitHub, you should download it to make sure that it looks right! If you haven’t included embed-resources: true in the YAML header, none of your figures will show up!
Another reminder: GitHub is not (generally) intended for sharing data, so if you upload the dataset to your GitHub repo, you will lose 5 points. You can avoid this problem by storing the data somewhere else on your local machine (outside of your repo), or by adding the data file to your .gitignore file.